Inventi Impact: Audio, Speech & Music Processing

Articles

Inventi:easm/30900/20

Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model

Research 2020 : April - June

Rehan Ahmad, Syed Zubair, Hani Alquhayz, Allah Ditta

Speaker diarization systems aim to find â??who spoke when?â?? in multi-speaker recordings.\nThe dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction\nrecordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds\nthe active speaker through audio-visual synchronization model for diarization. A pre-trained\naudio-visual synchronization model is used to find the synchronization between a visible person\nand the respective audio. For that purpose, short video segments comprised of face-only regions are\nacquired using a face detection technique and are then fed to the pre-trained model. This model is\na two streamed network which matches audio frames with their respective visual input segments.\nOn the basis of high confidence video segments inferred by the model, the respective audio frames\nare used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating\nspeaker specific clusters with high probability. We tested our approach on a popular subset of AMI\nmeeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal\nrecordings. A significant improvement is noticed with the proposed method in term of DER when\ncompared to conventional and fully supervised audio based speaker diarization. The results of the\nproposed technique are very close to the complex state-of-the art multimodal diarization which shows\nsignificance of such simple yet effective technique.

How to Cite this Article
Attribution/ CC Compliant Citation: Ahmad, Rehan, et al. \"Multimodal Speaker Diarization Using a Pre-Trained Audio-\nVisual Synchronization Model.\" Sensors 19.23 (2019): 5163.\nhttps://doi:10.3390/s19235163\nhttp://creativecommons.org/licenses/by/4.0/\n\nSome formatting elements, header, footer, logos, dates and pagination were modified while adapting this article.
Download Full Text

Call Us: +4 (800) 888-0008

Inventi Impact: Audio, Speech & Music Processing

Articles

Inventi:easm/30900/20

Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model

How to Cite this Article

Links

Contact Us